fix: `PushDownFilter` for `GROUP BY` on uppercase col names #16049

aditanase · 2025-05-14T13:12:23Z

Which issue does this PR close?

PushDownFilter does not push a predicate when the table has columns that are not all lowercase. Tried with and without enable_ident_normalization - no change. The logic inside parse_identifiers_normalized does not seem to properly detect quotes and it will lowercase the column used in the group by expression.

Here's the query I used, just for illustration:

SELECT * FROM (
    SELECT
        fm.Timestamp,
        SUM(fm.ConsumedUnits)
    FROM fm
    WHERE
        fm.Timestamp BETWEEN to_timestamp('2025-04-10') AND to_timestamp('2025-04-20')
    GROUP BY fm.Timestamp
)
WHERE Timestamp = to_timestamp('2025-04-12')

Expected query plan:

Aggregate: groupBy=[[fm.Timestamp]], aggr=[[sum(fm.ConsumedUnits)]]
  TableScan: fm projection=[ConsumedUnits, Timestamp], full_filters=[fm.Timestamp = TimestampMicrosecond(1744416000000000, Some("UTC")), fm.Timestamp >= TimestampMicrosecond(1744243200000000, Some("UTC")), fm.Timestamp <= TimestampMicrosecond(1745107200000000, Some("UTC"))]

Actual query plan:

Filter: fm.Timestamp = TimestampMicrosecond(1744416000000000, Some("UTC"))
  Aggregate: groupBy=[[fm.Timestamp]], aggr=[[sum(fm.ConsumedUnits)]]
    TableScan: fm projection=[ConsumedUnits, Timestamp], full_filters=[fm.Timestamp >= TimestampMicrosecond(1744243200000000, Some("UTC")), fm.Timestamp <= TimestampMicrosecond(1745107200000000, Some("UTC"))]

An alterate fix could use expr_to_columns to extract the columns, as in Unnest above:

let mut group_expr_columns: HashSet<Column> = HashSet::new();
for p in &agg.group_expr {
    expr_to_columns(&p, &mut group_expr_columns)?;
}

Question: should we make the same change in the Window functions branch?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes, from a client application.

I did not add any unit tests, none of the existing tests in this module use upper case columns. Tried to add another table/schema, but then the test was failing, I am unsure of how to control the lowercasing of column names.

Are there any user-facing changes?

aditanase · 2025-05-14T13:18:33Z

@adriangb Saw you recent commits in this area, would appreciate if you weighed in on this. Thank you! 🙌

adriangb · 2025-05-14T14:48:57Z

Hmm I've been doing a lot with the physical optimizer of the same name but haven't touched the logical optimizer. The recent changes may mean that the pushdown ends up happening regardless at the physical level but I think it's worth fixing the logical level anyway.

I don't fully understand the issue: does from_qualified_name_ignore_case do something different with quotes than from_qualified_name? I also don't see any quotes in your example.

aditanase · 2025-05-14T15:27:28Z

I don't fully understand the issue: does from_qualified_name_ignore_case do something different with quotes than from_qualified_name? I also don't see any quotes in your example.

Sorry for not being more clear. I was referring to these lines:

datafusion/datafusion/common/src/utils/mod.rs

Lines 297 to 298 in 923bfb7

    
           .map(|id| match id.quote_style { 
        
               Some(_) => id.value,

Reading it made me think that if I used quotes I might convince it to remain unchanged, but it still converts to lowercase timestamp, no matter what I tried, so I didn't include it in the code sample.

alamb · 2025-05-27T18:48:40Z

Can we please get a test for this fix so we don't break it again in the future?

alamb · 2025-06-09T13:56:43Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

aditanase · 2025-06-20T12:27:43Z

@alamb Thanks for waiting, added a test that would break without this change

xudong963

I also suggest adding sqllogictest based on the sql in PR summary

xudong963 · 2025-06-23T12:01:59Z

datafusion/optimizer/src/push_down_filter.rs

+
+    #[test]
+    fn filter_agg_case_insensitive() -> Result<()> {
+        let table_scan = test_table_scan_with_uppercase_columns()?;


If the table also has a column named 'a', what'll happen?

Great question, just tried this and it works as expected for both uppercase and lower case col, even if both are present in the schema at the same time. I added another test, lmk if we should keep it or it's overkill.

Also happy to see a related sqllogictest

aditanase · 2025-06-23T16:33:15Z

I also suggest adding sqllogictest based on the sql in PR summary

I'd be happy to, can you please point me at a sample patch or a good suite to add to? Last time I tried there was quite a bit of ceremony around SLT, not sure if I can get it right on first approach. Thanks!

alamb · 2025-06-25T19:28:33Z

Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

Ideally you should be able to extend one of the existing test files in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files

aditanase · 2025-08-06T11:12:18Z

@xudong963 @alamb apologies for the delayed update, I was out on vacation for most of July.

I added 2 slt tests and squashed/rebased - should be ready for merge now

aditanase · 2025-08-06T11:14:05Z

datafusion/sqllogictest/test_files/push_down_filter.slt

+05)--------AggregateExec: mode=Partial, gby=[A@0 as A], aggr=[sum(test_uppercase_cols.B)]
+06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+07)------------CoalesceBatchesExec: target_batch_size=8192
+08)--------------FilterExec: A@0 > 10


This is the key fix, seeing the FilterExec pushed right on top of the datasource exec.
Without the fix, the plan would look like this (with the filter executed AFTER the aggregation):

+ 02)--CoalesceBatchesExec: target_batch_size=8192 + 03)----FilterExec: A@0 > 10 + 04)------AggregateExec: mode=SinglePartitioned, gby=[A@0 as A], aggr=[sum(test_uppercase_cols.B)] + 05)--------DataSourceExec: partitions=1, partition_sizes=[0]

alamb

Thanks @aditanase

alamb · 2025-08-10T10:33:56Z

datafusion/optimizer/src/push_down_filter.rs

                    .iter()
-                    .map(|e| Ok(Column::from_qualified_name(e.schema_name().to_string())))
+                    .map(|e| {
+                        Ok(Column::from_qualified_name_ignore_case(


I need to study this more carefully, but this change to ignore case seems inconsistent with the rest of this file and also potentially incorrect

For example:

In for dialects where case is not important, universally ignoring the case could push down the wrong fields

For dialects where case is important, there seem to be many other places in this module that do not compare the case 🤔

So it seems:

We should be checking the case sensitivity config setting before comparing

Are there other places that should be changed to take name into account 🤔

@alamb thanks for taking a look. I agree the fix looks fishy, and would love to better understand the mechanics of case comparison. We can keep the tests and come up with a better fix that addresses your concerns.

It boils down to the ignore_case flag which is not very intuitive, as setting to true ends up keeping the UPPERCASE, while setting to false will lowercase the column name from the gby expression:
https://github.com/apache/datafusion/blob/main/datafusion/common/src/utils/mod.rs#L298-L299

The problem is introduced as we're roundtripping the correctly parsed column name

Column { relation: Some(Bare { table: "test" }), name: "A" }

through

Column::from_qualified_name(e.schema_name().to_string())

Which will lowercase it, as now there are no more double quotes around the field name. That's why I switched to Column::from_qualified_name_ignore_case, to avoid lowercasing as we're already starting from parsed/cased names.

Very open to suggestions

github-actions · 2025-10-11T02:00:18Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions bot added the optimizer Optimizer rules label May 14, 2025

aditanase changed the title ~~fix: PushDownFilter for GROUP BY on uppercase col names~~ fix: PushDownFilter for GROUP BY on uppercase col names May 14, 2025

aditanase force-pushed the fix-gby-pushdownfilter branch from 0495f24 to 14ceef8 Compare May 14, 2025 13:16

alamb marked this pull request as draft June 9, 2025 13:56

aditanase force-pushed the fix-gby-pushdownfilter branch 2 times, most recently from 4df8ec6 to ba36d39 Compare June 20, 2025 12:27

aditanase marked this pull request as ready for review June 20, 2025 12:27

xudong963 reviewed Jun 23, 2025

View reviewed changes

aditanase force-pushed the fix-gby-pushdownfilter branch from 4082863 to e335907 Compare August 6, 2025 11:10

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 6, 2025

aditanase commented Aug 6, 2025

View reviewed changes

fix: PushDownFilter for GROUP BY on uppercase col names

bc38c43

aditanase force-pushed the fix-gby-pushdownfilter branch from e335907 to bc38c43 Compare August 7, 2025 10:29

alamb reviewed Aug 10, 2025

View reviewed changes

github-actions bot added the Stale PR has not had any activity for some time label Oct 11, 2025

github-actions bot closed this Oct 19, 2025

fix: PushDownFilter for GROUP BY on uppercase col names #16049

fix: PushDownFilter for GROUP BY on uppercase col names #16049

Uh oh!

Conversation

aditanase commented May 14, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

aditanase commented May 14, 2025

Uh oh!

adriangb commented May 14, 2025

Uh oh!

aditanase commented May 14, 2025

Uh oh!

alamb commented May 27, 2025

Uh oh!

alamb commented Jun 9, 2025

Uh oh!

aditanase commented Jun 20, 2025

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

xudong963 Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditanase Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

xudong963 Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

aditanase commented Jun 23, 2025

Uh oh!

alamb commented Jun 25, 2025

Uh oh!

aditanase commented Aug 6, 2025

Uh oh!

aditanase Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

aditanase Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: `PushDownFilter` for `GROUP BY` on uppercase col names #16049

fix: `PushDownFilter` for `GROUP BY` on uppercase col names #16049

xudong963 Jun 23, 2025 •

edited

Loading